Specifying Document Structure: Differences in IbW and TEI Markup

نویسنده

C. M. Sperberg-McQueen

چکیده

I4m and the Standard Generalized Markup Language (SGML), specifically the SGML tag set created by the Text Encoding Initiative (TEI), are two major systems developed to make it easier to create and verify valid documents. Each attempts to specify and enforce explicit definitions of valid textual structures; each faces questions regarding the structural components of texts, as well as the choice of abstract structures for representing and of formal notations for specifying them. This paper focuses on the ways I4m and the TEI identify and classify the structural and other components of text; discusses the models of text underlying the two systems and the methods of text definition and validation they make possible; describes a number of specific issues that arise; considers some systematic differences; and describes one possible way in which they might coexist. Introduction First, 1'11 discuss the substantive questions of what As mechanical processing of text becomes easier, it also becomes easier and more important -to specify formally what a text is and to use that specification to ensure the validity of the data stream that represents the text in the machine. Validation becomes important because application software uses increasingly complex data structures for text representation, and because our mechanical processing can destroy or corrupt data with an efficiency and thoroughness that far exceed the wildest dreams of the most assiduous scholar working by hand. Validation has become easier because computer science has provided a rich set of data structures to use in representing texts and increasingly sophisticated notations for specifying the valid forms of those data structures. Today I want to discuss the specification of document structure in I4m and in the SGML tag set defined by the ACH/ACL/ALLC Text Encoding Initiative (TEI), an international effort to define an application-independent, language-independent , system-independent markup language for general use (especially in research). This has four parts: the structural components of texts are; and, second, the methodological questions of choosing abstract structures with which to represent texts and formal notations with which to specify the abstract structures. Third, I'll describe briefly a number of concrete problems in the proper application of such abstract structures and formal notations to preexisting texts of the sort studied by most textual scholars, and, finally, I'll describe how I think SGML and IPw can usefully coexist in practice. Any text-encoding scheme must provide ways to represent the characters of a text, its basic structure, intrinsic features other than structure, and extrinsic information associated with the text by an annotator. I am here concerned not with the first of these, but only with the other three. Substantive Issues: What Belongs in a Text? Basic text structure. On the basic structural components of text, there is a rather surprising agreement among the various markup languages in TUGboat, Volume 12 (1991), No. 3-Proceedings of the 1991 Annual Meeting current use at least among those which attempt to assign structure to texts. I 4 m implicitly divides a text into a title page (created by the \maket i t le command, which must be preceded by author, document title, and similar information), followed by the text body and, optionally, by back matter (marked with the \appendix command). The body and back matter comprise either undivided text or a series of \par ts or \chapters. Within parts, there is a straightforward hierarchy of chapter, section, subsection, subsubsection, paragraph, and subparagraph. in which the hierarchical relationships are enforced automatically. The TEI tag set similarly divides documents into front matter (which can contain more than the title page), body, and back matter, with body and the parts of the front and back matter all divided into hierarchically nested blocks of text. Since existing (historical) texts may use structural units with names other than chapter, etc., TEI uses the generic term div for these blocks of text: The text body is a series of s, divided into s, divided into s, etc. The user can specify what name should be associated with a given level by giving the name as the value of an SGML attribute on the tag; for example, . The current draft stops at , but this is a purely arbitrary decision and can be changed. An alternative proposal (used in some existing SGML tag sets) is to eliminate the redundant nesting-level numbers and replace through by the single tag or . Since the nesting level can be readily calculated at processing time, blocks at different levels can be processed differently. This is elegant but complicates life for whoever is specifying the processing. Lower-level floating s t ructures . Within the main structural divisions of the document, text is divided into paragraphs, and these have no visible internal formal structure. There are some chunks of text, however, that do have visible internal structure; these I call crystals, borrowing a term from Steven J. DeRose (in a TEI working paper). Crystals are internally structured free-floating units of text, such as figures, tables, or bibliographic citations. Leslie Lamport calls (some of) them floating bodies. I 4 W and the TEI recognize roughly the same set of large-scale crystals: lists, verbatim examples, displayed equations, figures, tables, and bibliographic references. The TEI further expects to provide tags for marking much smaller crystal structures like dates, addresses. personal and corporate names, and so on. This reflects a major difference between I 4 W and the TEI: does not need special markup for addresses or personal names, because these do not typically require special treatment in document layout. The closest IPW gets are with the conventions used by BIB^ to distinguish first names from last names based on where one puts the comma. The TEI is not exclusively or primarily concerned with producing hard copy from documents, but with making it possible to mark the documents' logical structure in support of whatever kind of processing the user might want to do. Historians. librarians, office-automation people, and others may all want special processing based on the internal structure of names and dates-not for printing, perhaps. but for indexing or other reasons. For the converse reason, the TEI has not yet made any concerted attempt to provide yet another language for the description of mathematical equations, figures, graphics, or tables. U r n , being concerned with processing for output (as well as with the logical structure of the text), can hardly get by without providing markup for such crystals. The TEI has thus far exploited a feature of SGML that allows sections of the text to be marked up in nonSGML notations so they can be processed by some appropriate processor. This keeps SGML out of the graphics-standards wars and allows designers of SGML tag sets to stay out, too. Although tables often have a clear logical structure, and it would make sense to attempt to capture this in descriptive markup, the TEI has yet to make any concrete recommendations in this area; this is an area of ongoing work. For bibliographic citations, the TEI provides a structured form patterned on the standard forms for bibliographic references developed by librarians, as well as a much less tightly structured form for those with less concern about database usage of their citations. The structured form provides more structure than appears to be available in the prose segments of IPW documents, but is less rich than the corresponding B I B W structure. This is an area in which the TEI tags must definitely be extended to at least the level of detail offered by BIB^. Phrase-level a t t r ibutes . Within the paragraph, the rigid hierarchical text structure of chapter, section, subsection, etc., suddenly breaks down, and we are confronted with a non-rigid mess with the consistency of soup. Within this soup, some larger chunks (crystals, like figures and tables) may be 416 TUGboat, Volume 12 (1991), No. 3-Proceedings of the 1991 Annual Meeting Specifying Document Structure: Differences in I P m and TEI Markup floating that we've already discussed. Some nonstructured bits may be floating there as well: emphasized phrases, quotations, and the like. Here, U W and the TEI take a very similar approach. Instead of describing the visual presentation of the text in a particular output medium,.both encourage the user to describe its logical characteristics. Thus, IPQX provides an \em command for emphasized text and suggests that the \bf, \sc. and similar commands "should appear not in the text but in the definitions of the commands that describe the logical structure." Similarly, the TEI provides several tags for marking words, phrases, or passages that are specially marked in some way:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tei-term: an Sgml-based Interchange Format for Terminology Files

TEI-TERM is a formal for interchanging terminology files in electronic form between various types of computers and terminology management software packages. The “TEI” of TEITERM stands for Text Encoding Initiative[7], which is a major international effort to formally define various document types which are conformant to SGML[3], a widely-accepted international standard for the markup of the str...

متن کامل

Searching Multi-hierarchical XML Documents: The Case of Fragmentation

To properly encode properties of textual documents using XML, mul tiple markup hierarchies must be used, often leading to conflicting markup in encodings. Text Encoding Initiative (TEI) Guidelines[1] recognize this problem and suggest a number of ways to incorporate multiple hierarchies in a single well-formed XML document. In this paper, we present a framework for pro cessing XPath queries o...

متن کامل

Specifying a TEI-XML Based Format for Aligning Text to Image at Character Level

This papers presents an experience of specifying and implementing an XML format for text to image alignment at word and character level within the TEI framework. The format in question is a supplementary markup layer applied to heterogeneous transcriptions of medieval Latin and French manuscripts encoded using different “flavors” of the TEI (normalized for critical editions, diplomatic or palae...

متن کامل

The Music Encoding Initiative (MEI)

This paper draws parallels between the Text Encoding Initiative (TEI) and the proposed Music Encoding Initiative (MEI), reviews existing design principles for music representations, and describes an eXtensible Markup Language (XML) document type definition (DTD) for modeling music notation which attempts to incorporate those principles.

متن کامل

مدل سازی شوک های مارک آپ با استفاده از مدل DSGE (مورد ایران)

This paper investigates the effects of markup shocks of domestic and export goods prices on macroeconomic variables by using a Dynamic Stochastic General Equilibrium (DSGE) model for Iran, in order to examine the effect of the growth of market power and monopoly in domestic and exporting markets from a macroeconomic viewpoint. To this end, the optimal pricing process of domestic, importing and ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Specifying Document Structure: Differences in IbW and TEI Markup

نویسنده

چکیده

منابع مشابه

Tei-term: an Sgml-based Interchange Format for Terminology Files

Searching Multi-hierarchical XML Documents: The Case of Fragmentation

Specifying a TEI-XML Based Format for Aligning Text to Image at Character Level

The Music Encoding Initiative (MEI)

مدل سازی شوک های مارک آپ با استفاده از مدل DSGE (مورد ایران)

عنوان ژورنال:

اشتراک گذاری